Characteristic Length Scale of Electric Transport Properties of Genomes 
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A tight-binding model together with a novel statistical method are used to investigate the relation 
between the sequence-dependent electric transport properties and the sequences of protein-coding 
regions of complete genomes. A correlation parameter Q is defined to analyze the relation. For 
some particular propagation length w ma x, the transport behaviors of the coding and non-coding 
sequences are very different and the correlation reaches its maximal value 0. ma x- w ma x and O max 
are characteristic values for each species. The possible reason of the difference between the features 
of transport properties in the coding and non-coding regions is the mechanism of DNA damage 
repair processes together with the natural selection. 

PACS numbers: 87.15.Aa, 87.14.Gg, 72.80.Le 



The conductance of DNA molecules is one of the cen- 
tral problems of biophysics because it plays a critical role 
in the biological systems. For example, it is postulated 
that there may be proteins which can locate the DNA 
damage by detecting the long-range electron migration 
properties^, @- And for the interest of applications, 
DNA is one of the most promising candidates which may 
serve as the building block of molecular electronics be- 
cause of its sequence- dependent and self-assembly prop- 
erties. 

There have been many experimental results on 
the conductance of DNA from different measurements 
for the last few years. Yet the results are still 
highly controversially- The experimental results al- 
most cover all possibilities, ranged from insulatingf"" 



semiconducting Ohmic[(| 0, and even induced 
superconductivity|8j. The diversity comes from the 
methods of the measurements and the preparation of 
DNA samples. One of the critical factors influencing 
the results is the contact of the DNA and electrodes (J, 
IE Ell EH- The different nucleotide sequences of the 
DNA molecules used in the experiments also diversify 
the results because the transport properties are sequence- 
dependent. 

Aside from the electrical properties, the statistical fea- 
tures of the symbolic sequences of DNA have also been 
studied intensely during the past vears flil H^ . Hil HfI llfiL 
IT^ | . The previous works are mainly focused on the corre- 
lations and linguistic properties of the symbols A, T, C, 
and G, which represent the four kinds of bases adenine, 
thymine, cytosine, and guanine of the nucleotides, re- 
spectively. The analyses also give some eccentric results. 
For example, the statistical behavior of the intron-free 
coding sequences is similar to random sequences while 
the intron-rich or junk sequences have long-range cor- 
relations. One should note that the root of these sta- 
tistical properties of the symbolic sequences are the re- 
sults of evolution, and the underlying driving forces are 
the principles of physics and chemistry. On the other 
direction, the correlation of sequences will influence the 



physical and chemical properties, such as the electric and 
mechanical properties of DNA[1^J. Thus it is reasonable 
to conjecture that the sequence-dependent electric prop- 
erties can play critical roles during the evolution process 
in nature by some ways such as the DNA damage repair 
processes [ij, |2j. In this Letter, the relation between elec- 
tric transport properties and the gene-coding/nocoding 
parts of genomic sequences will be discussed. 

The simplest effective tight-binding Hamiltonian for a 
hole propagating in the DNA chain can be written asflij 
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where each lattice point represents a nucleotide base of 



the chain. 



(c„) is the creation (destruction) oper- 



ator of a hole at the n— th site. e„ is the potential 
energy at the n— th site, which is determined by the 
ionization potential of the corresponding nucleotide. e n 
equals to 8.24 eV, 9.14 eV, 8.87 eV, and 7.75 eV for n = 
A, T, C, and G, respectivelyHJ. The DNA molecule 
is assumed to be connected between two semi-infinite 
electrodes with energy e m = ec = 7.75 eV. The hop- 
ping integral t n ,„+i = t m — 1 eV for electrodes and 
t n ,n+i = tDNA for nucleotides. tuNA is assumed to 
be nucleotide-independent here for simplicity. Typical 
value of tr >NA = 0.1 ~ 0.4 eV from the first-principle 
calculation |2ll li^. To reduce the back scattering effect 
at the contacts, larger toNA (up to 1 eV) is also used in 
this studv[l9j. Note that n G (— oo, 1] and n € [N+l, oo) 
are for electrodes and n £ [2,N] are for nucleotides. 

The eigenstates of the Hamiltonian |'F) = J2 n a n\ n ) 
(\n) represents the state that the hole is located in the 
n— th site) can be solved exactly by using the transfer 
matrix method: 
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where 

( _E-e„ tn-l,n \ 

t».» +1 t».» +1 j ( 3 ) 

P is the energy of the injected hole. In electrodes, the 



wave functions are plane waves and the dispersion of 
the hole is e m + 2t m cosk. So the range of possible E 
is [e m — 2t m , e rn + 2t rn ] = [5.75eV, 9.75eVl. The transmis- 
sion coefficient has the following form|23| 
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The transmission of several sequences of complete 
genomes S = (si, sa, • • • , SN tot ) is studied (s, = A, T, 
C, or G). Since the total length Not of the complete 
genome is usually much longer than the distance which 
holes can migrate along the DNA chain even for the 
smallest N to t for viruses, we won't measure the trans- 
mission through the whole chain but only shorter seg- 
ments instead. A "window" with width w is defined 
to extract a segment Si tW — (sj, Sj+i, ■ • • , Si+^-i) for 
1 < i < N w — N t ot — w + 1 from S. Starting from 
i = l and sliding the window, we can get the "transmis- 
sion sequence" T w (E,i) of Si lW for all i, which depends 
on the energy of the injected hole E, the starting posi- 
tion of the segment i, and the propagation length w. For 
further analysis of the whole genome sequences, T W (E, i) 
is integrated in an energy interval [E, E + AE}: 

/•E+AE 

f w {E,AE,i) = T w {E',i)dE' (5) 

Je 

In the remaining of the Letter, the transmission is inte- 
grated for the whole bandwidth, that is, E — 5.75 eV 
and AE = 4 eV. And these two values will be omitted in 
the related formulas for short. 300 base pairs at the two 
ends of the DNA chain will be omitted in the following 
analysis because the telomere sequences at the terminals 
usually have larger transmission (due to the periodicity) 
and will dominate some of the average properties. Thus 
N w = N^t -W + 1 - 2 X 300. 

The averaged transmission T£ 1 ve = ■jg-J^i^wi'i) ver- 
sus propagation length w is plotted in Fig^for the third 
chromosome of Saccharomyces cerevisiae (bakery yeast, 
accession number = NC. 001135 for GenBank|24|. simpli- 
fied as Y3 for short) with several values of toNA/to- Tf D ve 
decreases exponentially with increasing w, which is con- 
sistent with the localization picture. The curves can be 
fitted by the function T* ve = ae~ w / w °. The inset of Fig|U 
shows the averaged localization length wq for each tjj^A- 
Note this is an averaged result of the complete genome, 
and the possibility of high conductance of some particu- 
lar segments is not ruled out. Other important features 
are that T w (i) decreases faster for smaller toNA, and wq 
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FIG. 1: TZ ve for Y3 with t DNA = 1.0 eV (full circles), 0.9 
eV (open circles), 0.8 eV (full triangles), 0.7 eV (open trian- 
gles), 0.6 eV (full squares), 0.5 eV (open squares), and 0.4 eV 
(diamonds). Solid, dotted, dash, dash-dotted lines are for a 
random sequence R3 with tDNA ~ 1.0, 0.9, 0.8 and 0.4 eV, 
respectively. (Inset) Localization length too of Y3 (full circles) 
and R3 (open circles) for each tDNA (see text). 

is nearly proportional to tjjNA- The reason is that the 
back scattering is stronger for smaller tuNA- Although 
smaller tn.NA (< 0.4 eV) values are more physical, the 
signal revealing the intrinsic properties of the sequences 
may be smeared out by the strong back scattering. T£f e 
for a random sequence P3 with the same length and ra- 
tios of the four bases as Y3 are also shown in the lines 
of Fig^ It is clear that the transmission of the random 
sequence decreases faster (smaller wq) than the natural 
genome due to the larger disorder. This result is consis- 
tent with Ref.0. 

Since the transport properties are related to the DNA 
damage repair mechanism, there could be correlation be- 
tween the locations of genes and the corresponding in- 
tegrated transmission T w (i) . In Fig[21 T24o(*) and the 
coding regions are compared for part of the sequence of 
Y3. It seems that most of the sharp peaks of T 2 4o(i) are 
located in the protein-coding region. 

To check this correlation in a more quantitative way, 
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FIG. 2: Comparison of ?24a(i) (line, toNA = 1 eV) and 
the coding regions (shaded area) of the range from 5000— th 
to 30000— th nucleotide of Y3. (Inset) Enlarged plot from 
22000-th to 24000-th nucleotide. 



FIG. 3: Cl(w) for t DNA /t Q = 1.0 (circles), 0.8 (triangles), 
0.6 (squares), and 0.4 (diamonds) of Y3. (Inset) !!„, (full 
circles) and w max (open circles) as functions of toNA- 



I first define a binary "coding sequence" G(i) — 1 (0) if 
the i— th nucleotide was in the protein-coding (noncod- 
ing) region, and then normalize G(i) and T w (i) in the 
following way 



G'(i) = G{i)--Lj2 G (jy>9( 
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and 
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The overlap between these two normalized sequences is 
defined as III III 



(7) 



In Fig[31f2(u>) for Y3 is shown for different toNA- For 
t d na — 1 eV, there is a maximum at w rnax = 240 with 
Qmax = 0.103. Note that £l m ax denotes the maximal ab- 
solute value of Q(w) and can be positive or negative. The 
strong positive overlap implies that the holes can move 
more freely in the coding regions. As toNA decreases, 
both flmax and w max decrease. For t jj ^va < 0.5 eV, 
the overlap becomes negative which means the electronic 
conductance is poorer at the coding regions. The depen- 
dence of ftmax and Wmax on toNA are shown in the inset 
of Fig|3J Although the values of w max and fl m ax vary 
with toNA, G{i) and T w (i) are correlated in general. 

Several Q(w) with t^NA = 1 eV for different genomes 
are shown in FigQ] It can be seen that there is maximal 
positive or negative overlap il max at some "characteristic 
migration length" w max for each genome. Sl(w) for yeast 
chromosomes III, VIII and X, and Ureaplasma parvum 
serovar 3 str. ATCC 700970 are positive, which means 



the coding regions have larger conductance. On the other 
hand, Q(w) for acinetobacter sp. ADP1, Deinococcus ra- 
diodurans Rl chromosome II, and chlamydia trachomatis 
D/UW-3/CX are negative, which means the coding re- 
gions have smaller conductance. (fi maX) w max ) for these 
genomes are summarized in TABLE[I] 
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FIG. 4: Q(w) for several genomes: chromosomes III (full cir- 
cles), VIII (open circles), and X (full triangles) of yeast, Ure- 
aplasma parvum serovar 3 str. ATCC 700970 (full diamonds), 
acinetobacter sp. ADP1 (full squares), Deinococcus radiodu- 
rans Rl chromosome II (open triangles), and chlamydia tra- 
chomatis D/UW-3/CX (open squares). Red circles with error 
bars are averaged £l(w) for 10 randomized sequences of yeast 
chromosome III (see text). 

To ensure that Q(w) shown above are physically and 
biologically meaningful, we compare the results with ran- 
dom sequences. Ten sequences generated by the same 
way as i?3 are analyzed and the averaged Q(w) (overlap 
with the g(i) of Y3) are shown in Fig0](open circles with 
error bars). It is clear that its overlap is about one order 
of magnitude smaller then the real sequences. So Q m ax 
and w max are not artifacts, but intrinsic properties of 
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TABLE I: flmax and w m „ for the genomes studied in FigUl 



Genome 


Access No. 






Yeast III 


NC. 001135 


240 


0.103 


Yeast VIII 


NC.001140 


200 


0.077 


Yeast X 


NC.001142 


170 


0.085 


Ureaplasma parvum 


NC. 002162 


130 


0.041 


serovar 3 str. ATCC 700970 








acinetobacter sp. ADP1 


NC.005966 


80 


-0.129 


Deinococcus radiodurans 


NC. 001264 


80 


-0.149 


Rl chromosome II 








chlamydia trachomatis 


NC. 000117 


50 


-0.075 



D/UW-3/CX 



genomes from the above comparison. 

From the analysis above, it can be concluded that 
w-max is a characteristic length scale of the electric trans- 
port, which can make out the gene-coding regions. And 
Qmax stands for the "sensibility" of this probing process. 

The possible biological reason of these correlations is 
the mechanism of DNA damage repair processes. Since 
proteins use the transport properties to probe the loca- 
tion of DNA damagepl Q, the transport of the coding 
areas should have particular features for the detecting 
processes, while those of the non-coding regions are some- 
what irrelevant. 

Fig01 shows two important features of Q max . First, 
each species has their characteristic values (w max , ^max)- 
It can be postulated that the mechanisms detecting the 
defects of DNA of different species are different due to the 
various biological and environmental features. Second, 
(wmaxi Qmax) of the different chromosomes of the same 
species (yeast here) are very similar because they are 
in the same environment, hence the same DNA damage 
repair mechanism. 

It should be noted that the model used in this study is 
an oversimplified one. However, one of the most impor- 
tant properties can be extracted from this coarse-grained 
model - the coding regions have very different transport 
behavior from the noncoding parts at the characteristic 
length scale w ma x- And each species has different w max 
to adjust their environment. In the future, the model will 
be finer-grained by introducing the more realistic interac- 
tions like the base-dependent hopping 27J, the sequence 
dependent potentials 28], and the charge-charge interac- 
tions. 

In summary, with a new method combining the trans- 
fer matrix approach and symbolic sequence analysis, the 
correlation between the transport properties and the po- 
sitions of genes is studied for complete genomes. There 
are two characteristic values 0, max and w max for each 
genome. These two values can provide information for 
taxonomy or the mechanism of evolution. 
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